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Abstract 

A central question for active learning (AL) is: “what is the optimal selection?” Dehning 
optimality by classifier loss produces a new characterisation of optimal AL behaviour, by 
treating expected loss reduction as a statistical target for estimation. 

This target forms the basis of model retraining improvement (MRI), a novel approach 
providing a statistical estimation framework for AL. This framework is constructed to 
address the central question of AL optimality, and to motivate the design of estimation 
algorithms. 

MRI allows the exploration of optimal AL behaviour, and the examination of AL heuris¬ 
tics, showing precisely how they make sub-optimal selections. The abstract formulation of 
MRI is used to provide a new guarantee for AL, that an unbiased MRI estimator should 
outperform random selection. 

This MRI framework reveals intricate estimation issues that in turn motivate the con¬ 
struction of new statistical AL algorithms. One new algorithm in particular performs 
strongly in a large-scale experimental study, compared to standard AL methods. This 
competitive performance suggests that practical efforts to minimise estimation bias may 
be important for AL applications. 

Keywords: active learning, model retraining improvement, estimation framework, ex¬ 
pected loss reduction, classihcation 

1. Introduction 

Classification is a central task in statistical inference and machine learning. In certain cases 
unlabelled data is plentiful, and a subset can be queried for labelling. Active learning (AL) 
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seeks to intelligently select this subset of unlabelled examples, to improve a base classifier. 


Examples include medical image diagnosis and document categorisation (Dasgupta and 


Hsu 

2008 

Hoi et al. 

2006 


approaches reviewed by Settles (2009); Olsson (2009). AL method performance is often 


assessed by large-scale experimental studies such as Guyon et al. 


(2010); Evans et al. (2013) 


Guyon et al. 

(2011 

); 

Kumar et al. 


A prototypical AL scenario consists of a classification problem and a classifier trained 
on a small labelled dataset. The classifier may be improved by retraining with further 
examples, systematically selected from a large unlabelled pool. This formulation of AL 
raises the central question for AL, “what is the optimal selection?” 


Performance in classification is judged by loss functions such as those described in Hand 


(1997), suggesting that optimality in AL selection should be characterised in terms of clas¬ 
sifier loss. This suggests that the optimal selection should be defined as the example that 
maximises the expected loss reduction. This statistical quantity forms the basis of model 
retraining improvement (MRI), a novel statistical framework for AL. Compared to heuristic 
methods, a statistical approach provides strong advantages, both theoretical and practical, 
described below. 

This MRI estimation framework addresses the central question by formally defining 
optimal AL behaviour. Creating a mathematical abstraction of optimal AL behaviour allows 
reasoning about heuristics, e.g. showing precisely how they make sub-optimal choices in 
particular contexts. Within this framework, an ideal unbiased MRI estimator is shown to 
have the property of outperforming random selection, which is a new guarantee for AL. 

Crucially, MRI motivates the development of novel algorithms that perform strongly 
compared to standard AL methods. MRI estimation requires a series of steps, which are 
subject to different types of estimation problem. Algorithms are constructed to approximate 
MRI, taking different estimation approaches. 

A large-scale experimental study evaluates the performance of the two new MRI esti¬ 
mation algorithms, alongside standard AL methods. The study explores many sources of 
variation: classifiers, AL algorithms, with real and abstract classification problems (both 
binary and multi-class). The results show that the MRI-motivated algorithms perform 
competitively in comparison to standard AL methods. 

This work is structured as follows: first the background of classification and AL are 
described in Section Section defines MRI, illustrated by an abstract classification 
problem in Section [3.2[ MRI estimation algorithms are described in Section]^ and evaluated 
in a large-scale experimental study of Section followed by concluding remarks. 


2. Background 

The background contexts of classification and AL are described, followed by a brief review 
of relevant literature, with particular focus on methods that are used later in the paper. 

2.1 Classification 

The categorical response variable V is modelled as a function of the covariates X. For the 
response Y there are k classes with class labels {ci, C 2 ,..., c^}. Each classification example 
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is denoted (x, y), where x is a d-dimensional covariate vector and y is a class label. The 
class prior is denoted tt. 

The Bayes classifier is an idealisation based on the true distributions of the classes, 
thereby producing optimal probability estimates, and class allocations given a loss function. 
Given a covariate vector x, the Bayes classifier outputs the class probability vector of 
y|x denoted p = {pj)i- A probabilistic classifier estimates the class probability vector as 
P = (Pj)i) allocates x to class y using decision theoretic arguments, often using a 
threshold. This allocation function is denoted h: y = /i(p). For example, to minimise 
misclassification error, the most probable class is allocated: y = h{p) = argmaXj(pj). The 
objective of classification is to learn an allocation rule with good generalisation properties. 

A somewhat non-standard notation is required to support this work, which stresses the 
dependence of the classifier on the training data. A dataset is a set of examples, denoted 
D = where i indexes the example. This indexing notation will be useful later. 

A dataset D may be subdivided into training data Dt and test data De- This dataset 
division may be represented by index sets, for example, T D E = {1, ...,n}, showing the 
data division into training and test subsets. 

First co nsider a param etric classifier, for example linear discriminant analysis or logistic 


regression (Bishop 


2007 


Chapter 4). A parametric classifier has estimated parameters 0, 
which can be regarded as a fixed length vector (fixed given d and k). These parameters 
are estimated by model fitting, using the training data: 6 = 9{Dt), where 9{) is the model 
fitting function. This notation is intended to emphasize the dependence of the estimated 
parameters 6 on the training data Dt- 

Second, this notation is slightly abused to extend to non-parametric classifiers. The 
complexity of non-parametric classifiers may increase with sample size, hence they cannot 
be represented by a fixed length object. In this case 0 becomes a variable-length object 
containing the classifier’s internal data (for example the nodes of a decision tree, or the 
stored examples of AT-nearest-neighbours). 

While the contents and meaning of 0 would be very different, the classifier’s functional 
roles are identical: model training produces 9, which is used to predict class probabilities. 
This probability prediction is denoted p = (/>(0,x). These predictions are in turn used to 
assess classifier performance. 

To consider classifier performance, first assume a fixed training dataset Dt- Classifier 
performance is assessed by a loss function, for example error rate, which quantifies the 
disagreement between the classifier’s predictions and the truth. The empirical loss for a 
single example is defined via a loss function g{y,p)- Many loss functions focus on the 
allocated class, for example error rate, ye(y, p) = l(y / ^(p))- Other loss functions focus 
on the predicted probability, for example log loss, yo(p) = Pj)- 

The estimated probabilities p are highly dependent on the estimated classifier 9- To 
emphasize that dependence, the empirical loss for a single example is denoted M{9,x, y) = 
g{y,p)- For example, error rate empirical loss is denoted Me(0,x,y). 

In classification, generalisation performance is a critical quantity. For this reason, em¬ 
pirical loss is generalised to expected loss, denoted L{9): 


L{9) = Ex,Y[M{9,x,y)]=EY\xEx[M{9,x,y)]- 
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This expected loss L is defined as an expectation over all possible test data, given a specific 
training set. The expected error rate and log loss are denoted and Lq. Hereafter loss 
will always refer to the expected loss L. The loss L{0) is dependent on the data D used to 
train the classifier, emphasized by rewriting L{6) as L{9{D)) since 0 = 0{D). 

The change in the loss as the number of labelled examples increases is of great method¬ 
ological interest. This function is known as the learning curve, typically defined as the 
change of expected loss with the number of examples. Learning curves are illustrated in 
Figure and discussed 


m 


Perlich et al. ( 

2003) 

Gu et al. 

(2001 ; 

Kadie 

(1995 


2.2 Active Learning 

The context for AL is an abundance of unlabelled examples, with labelled data either 
expensive or scarce. Good introductions to AL are provided by Dasgupta (2011), Settles 


(2009) and Olsson (2009). 


An algorithm can select a few unlabelled examples to obtain their labels from an oracle 
(for example a human expert). This provides more labelled data which can be included 
in the training data, potentially improving a classifier. Intuitively some examples may be 
more informative than others, so systematic example selection should maximise classifier 
improvement. 

In pool-based AL, there is an unlabelled pool of data Xp from which examples may be 
selected for labelling. This pool provides a set of examples for label querying, and also gives 
further information on the distribution of the covariates. Usually there is also a (relatively 
small) initial dataset of labelled examples, denoted Dj, typically assumed to be iid in AL. 
This work considers the scenario of pool-based AL. 

In AL it is common to examine the learning curve, by repeating the AL selection step 
many times {iterated AL). At each selection step, the loss is recorded, and this generates 
a set of losses, which define the learning curve for the AL method. Iterated AL allows the 
exploration of performance over the learning curve, as the amount of labelled data grows. 
This repeated application of AL selection is common in both applications and experimental 


studies (Guyon et ah, 2011; Evans et ah, 2013). 


In contrast to iterated AL, the AL selection step may occur just once {single-step AL). 
The question of iterated or single-step AL is critical, because iterated AL inevitably produces 
covariate bias in the labelled data. The covariate bias from iterated AL creates a selection 
bias problem, which is intrinsic to AL. 

At each selection step, an AL method may select a single example from the pool {indi¬ 
vidual AL) or several examples at once {hatch AL). AL applications are often constrained 


to use batch AL for pragmatic reasons (Settles, 2009). 


Turning to AL performance, consider random selection (RS) where examples are chosen 
randomly (with equal probability) from the pool. By contrast, AL methods select some 
examples in preference to others. Under RS and AL, the classifier receives exactly the same 


number of labelled examples; thus RS provides a reasonable benchmark for AL (Guyon 


et al. 

2011 

Evans et al. 

2013 


2013). The comparison of methods to benchmarks is available in 


Classifier performance should improve, at least on average, even under the benchmark 
RS, since the classifier receives more training data (an issue explored below). AL perfor- 
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Number of Labelled Examples 


Figure 1: Performance comparison of active learning and random selection, showing that a 
classifier often improves faster under AL than under RS. In both cases the loss decreases 
as the number of labelled examples increases; however, AL improves faster than RS. These 
curves are smoothed averages from multiple experiments. The black vertical line illustrates 
the fixed-label comparison, whereas the blue horizontal line shows the fixed-loss comparison 
(see Section 2.2). The classihcation problem is “Abalone” from UCI, a three-class problem, 
using classifier 5-nn, and Shannon entropy as the AL method. 


mance assessment should consider how much AL outperforms RS. Hence AL performance 
addresses the relative improvement of AL over RS, and the relative ranks of AL methods, 
rather than the absolute level of classiher performance. Figure shows the losses of AL 
and RS as the number of labelled examples increases. 

Figureshows two different senses in which AL outperforms RS: first AL achieves better 
loss reduction for the same number of labels {fixed-label comparison), and second AL needs 
fewer labels to reach the same classifier performance {fixed-loss comparison). Together the 
fixed-label comparison and fixed-loss comparison form the two fundamental aspects of AL 
performance. The hxed-label comparison first hxes the number of labels, then seeks to 
minimise loss. Several established performance metrics focus on the fixed-label comparison: 
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AUA, ALC and WI (Guyon et al., 2011 Evans et al., 2013). The fixed-label comparison is 


more common in applications where the costs of labelling are significant (Settles, 2009). 


Under the hxed-loss comparison, the desired level of classiher loss is fixed, the goal being 
to minimise the number of labels needed to reach that level. Label complexity is the classic 
example, where the desired loss level is a fixed ratio of asymptotic classiher performance 


(Dasgupta, 2011). Label complexity is often used as a performance metric in contexts where 


certain assumptions permit analytically tractable results, for example Dasgupta (2011|) 


2.3 Overview of Active Learning Methods 

A popular AL approach is the uncertainty sampling heuristic, where examples are chosen 
with the greatest class uncertainty (Thrun and Moller, 1992, Settles, 2009). This approach 


selects examples of the greatest classiher uncertainty in terms of class membership prob¬ 
ability. The idea is that these uncertain examples will be the most useful for tuning the 
classiher’s decision boundary. Example methods include Shannon entropy (SE), least con- 
hdence and maximum uncertainty. For a single unlabelled example x, least conhdence is 
dehned as Ul{x,9{D)) = 1 — p{y\x), where p{y\x) is the classiher’s estimated probability 
of the allocated class y. Shannon entropy is dehned as Ue{'^, 9{D)) = Pj los(Pj)- The 
uncertainty sampling approach is popular and efficient, but lacks theoretical justihcation. 

Version space search is a theoretical approach to AL, where the version space is the set of 


hypotheses (classihers) that are consistent with the data (Mitchell, 1997, Dasgupta, 2011). 


Learning is then interpreted as a search through version space for the optimal hypothesis. 
The central idea is that AL can search this version space more efficiently than RS. 


Query by committee (QBC) is a heuristic approximation to version space search (Se- 


ung et al., 1992). Here a committee of classifiers is trained on the labelled data, which 


then selects the unlabelled examples where the committee’s predictions disagree the most. 
This prediction disagreement may focus on predicted classes (for example vote entropy) or 
predicted class probabilities (for example average Kullback-Leibler divergence); see 


Olsson 


(2009). These widely used versions of QBC are denoted QbcV and QbcA. A critical choice 
for QBC is the classifier committee, which lacks theoretical guidance. In this sense version 
space search leaves the optimal AL selection unspecified. 

Another approach to AL is exploitation of cluster structure in the pool. Elucidating the 


cluster structure of the pool could provide valuable insights for example selection. Dasgupta 


(2011) gives a motivating example: if the pool clusters neatly into b class-pure clusters where 


b = k, then b labels could suffice to build an optimal classiher. This very optimistic example 
does illustrate the potential gain. 

A third theoretical approach, notionally close to our contribution, is error reduction. 


introduced in Roy and McCallum (2001). This approach minimises the loss of the retrained 


classiher, which is the loss of the classiher which has been retrained on the selected example. 
Roy and McCallum consider two loss functions, error rate and log loss, to construct two 
quantities, which are referred to here as expected future error (EFE) and expected future log 
loss (FELL). Those authors focus on methods to estimate EFE and FELL, before examining 
the experimental performance of their estimators. 
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Given a classifier fitting function 6, labelled data D and a single unlabelled example x, 
EFE is defined as 


EFE(x, 0, D) = -Ey\^[LMD U (x, E))] 


k 


'^{Pj Le{e{D U (x, Cj))}, 
i=i 


where Lg is error rate (see Section 2.1). EFLL is defined similarly to EFE, with log loss Lo 
replacing error rate Lg. Both of these quantities average over the unobserved label E|x. 

Roy and McCallum define an algorithm to calculate EFE, denoted EfeLc, which approx¬ 
imates the loss using the unlabelled pool for efficiency. Specifically it approximates error 
rate Lg by the total least confidence over the entire pool: 


Lg{e{D))^ ^ UL{^i,9{D)), 

where Xp are the unlabelled examples in the pool. The uncertainty function is intended 
to capture the class uncertainty of an unlabelled example. 

Roy and McCallum propose the following approximation for the value of EFE by calcu¬ 
lating 


fi{y.,e,D) =^ C/L(xi,6l(T»U (xi,Cj))) > =^ (1 - p(yi|xi)) i . 

j=l [ Xi&Xp J j=l [ Xi&Xp J 

( 1 ) 


Here pj is the current classifier’s estimate of the class probability for class j, while in is the 
predicted label for Xj after a training update with the example (x, Cj). Note that EfeLc uses 
the the classifier’s posterior estimates after an update (to estimate the loss), whereas the 
uncertainty sampling approaches use the current classifier’s posterior estimates (to assess 
uncertainty). 

This approximation of Lg by the total least confidence over the pool is potentially 
problematic. R is easy to construct cases (for example an extreme outlier) where a labelled 
example would reduce a classifier’s uncertainty, but also increase the overall error; such 
examples call into question the approximation of error by uncertainty. In the absence of 
further assumptions or motivation, it is hard to anticipate the statistical properties of /i in 
Equation as an estimator. Further, EfeLc uses the same data to train the classifier and 
to estimate the class probabilities, thereby risking bias in the estimator (an issue explored 
further in Section]^. 

The error reduction approach is similar in spirit to MRI, since the optimal example 
selection is first considered, and then specified in terms of classifier loss. In that sense, the 
quantity EFE is a valuable precursor to model retraining improvement, which is defined 
later in Equation However EFE omits the loss of the current classifier, which proves 
important when examining improvement (see Section 3.2). Further, EFE is only defined for 
individual AL, while MRI defines targets for both batch and individual AL. 

The estimation of a statistical quantity, consisting of multiple components, raises several 
statistical choices, in terms of component estimators and how to use the data. These choices 
are described and explored in Section whereas Roy and McCallum omit these choices. 
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providing just a single algorithmic approach. In that sense, Roy and McCallum do not 
use EFE to construct an estimation framework for algorithms. Nor do Roy and McCallum 
use EFE to examine optimal AL behaviour, or compare it to the behaviour of known AL 
methods; Section [3 . 2| provides such an examination and comparison using MRI. Finally, the 
EFE algorithms do not show strong performance in the experimental results of Section 
The current literature does not provide a statistical estimation framework for AL; MRI 
addresses this directly in Section]^ 

3. Model Retraining Improvement 

Here the statistical target, model retraining improvement, is defined and motivated as an 
estimation target, both theoretically and for applications. This further lays the groundwork 
for MRI as a statistical estimation framework for AL. This Section dehnes the statistical 
target as an expectation, while Section describes estimation problems, and algorithms for 
applications. 

3.1 The Definition of Model Retraining Improvement 

Table 1: Notation. 


Notation 


Symbol 

Description 

(X,T) 

Underlying distribution of the classihcation problem 

P 

Bayes class probability vector, for covariate x: p = p(y x) = {p{cj\:>^)}j^l 

e 

Classifier training function 

e 

Classifier estimated parameters, where 9 = 9{Dt) 

(!> 

Classiher prediction function; class probability vector p = (p{9, x) 

Ds 

The labelled data: Ds = {Xs,Ys) = 

Xp 

The unlabelled pool 


Statistical target, optimal for individual AL 

B'^ 

Statistical target, optimal for batch AL 

L 

Classifier loss 

L'j 

Classifier future loss, after retraining on (x, Cj): Dj = L{9{Ds U (x, Cj)) 

L' 

Classifier future loss vector, for covariate x: L' = {L{9{Ds U (x, 


The notation is summarised in Table To define the statistical target, expectations 
are formed with respect to the underlying distribution (X, T). Assume a fixed dataset Ds 
sampled i.i.d. from the joint distribution (X, Y). The dependence of the classifier 6 on the 
data Ds is critical, with the notation 6 = 9{Ds) intended to emphasize this dependence. 

First assume a base classifier already trained on a dataset Ds- Consider how much a 
single labelled example improves performance. The single labelled example (x, y) will be 
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chosen from a labelled dataset Dy/. The loss from retraining on that single labelled example 
is examined in order to later define the loss for the expected label of an unlabelled example. 

Examine the selection of a single labelled example (x, y) from Dy/, given the labelled 
data Ds, the classifier training function 9 and a loss function L. The reduction of the loss 
for retraining on that example is defined as actual-MRI, denoted 

Q“(x, y, 9, Ds) = L{9{Ds)) - L{9{Ds U (x, y)). 

is the actual classifier improvement from retraining on the labelled example (x, y). 
The goal here is to maximise the reduction of loss. The greatest loss reduction is achieved 
by selecting the example (x*,y*) from Dyy that maximises given by 

(x*,y*) = aigmax {yL,y, 9, Ds). 

{x,y)eDw 

Turning to AL, the single example x is unlabelled, and will be chosen from the unlabelled 
pool Xp. Here the unknown label of x is a random variable, T|x, and taking its expectation 
allows the expected loss to defined, this being the classifier loss after retraining with the 
unlabelled example and its unknown label. Thus the expected loss is defined using the 
expectation over the label y|x to form conditional-MRI, denoted Q^: 

Q'=(x, 9, Ds) = Ey|,[Q“(x, y, 9, Ds)] = L{9{Ds)) - Ey\AL{9{Ds U (x, T))] 

k k 

= L{9{Ds)) - Y,{P 3 L{9{Ds U (x, cj))} = L{9{Ds)) - Y,Pj = HOiDs)) - P • L', (2) 

^ Term Tc ^ 

Term Te 


where p denotes the Bayes class probability vector p(y|x). Dj denotes a single future 
loss, from retraining on Ds together with one example x given class cj. L' denotes the 
future loss vector, i.e. the vector of losses from retraining on Ds together with one example, 
that example being x combined with each possible label cj: L' = = {L{9{Ds U 

{^Xj)))}’j=v 

Term Tc is the loss of the current classifier, given the training data Ds- Term Tg is the 
expected future loss of the classifier, after retraining on the enhanced dataset {DsD{x, y|x)). 

is defined as the difference between Terms Tc and Tg, i.e. the difference between the 
current loss and the expected future loss. Thus defines the expected loss reduction, from 
retraining on the example x with its unknown label. In this sense is an improvement 
function, since it defines exactly how much this example will improve the classifier. 

The unlabelled example x* from the pool Xp that maximises is the optimal example 
selection: 

X* = arg max Q‘^{x, 9, Ds) ■ (3) 

xGXp 

Novel algorithms are constructed to estimate the target Q^, given in Section]^ 

For an abstract classihcation problem, the target can be evaluated exactly, to reveal 
the best and worst possible loss reduction, by maximising and minimising Q^. Figure 
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Figure 2: The best and worst AL performance curves are obtained by maximising and min¬ 
imising the target Q^, which demonstrate the extremes of AL performance. With simulated 
data, can be calculated exactly; here the classification problem is the Four-Gaussian 
problem (illustrated in Figure 6a). These curves are smoothed from multiple experiments, 
using the classifier 5-nn. 


shows that the best and worst AL performance curves are indeed obtained by maximising 
and minimising Q'^. 


The statistical quantity defines optimal AL behaviour for any dataset Ds, whether 
iid or not, including the case of iterated AL, which generates a covariate bias in Ds (see 
Section 2.2). Given Q'^ for the selection of a single example, i.e. for individual AL, the 
optimal behaviour is now extended to batch AL, the selection of multiple examples, via the 
target B^, given below. 


3.1.1 Model Retraining Improvement for Batch Active Learning 

In batch AL, multiple examples are selected from the pool in one selection step. Each 
chosen batch consists of r examples. Here MRI provides the statistical target the batch 
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improvement function, defined as the expected classifier improvement over an unknown set 
of labels. 

First examine a fully labelled dataset (x^jy^), where R denotes the index set {1, 

For that fully labelled dataset, the actual loss reduction is denoted 

-B“(xr, yr, 9, Ds) = L{9{Ds)) - L(9{Ds U (xr, yr)). 

Second consider the AL context, with a set of unlabelled examples xr, which is a single 
batch of examples selected from the pool. The expected loss reduction for this set of 
examples is denoted B^\ 

B\^r, 9, Ds) = Yr, 9, Ds)] = L{9{Ds)) - [L{9{Ds U (xr, Yr))] 

k k k 

= L{9{Ds)) - X] X] ^iPhPh---Pjr X L{9{DsU{xi,Cj^) U {x 2 ,Cj ^)--U (xr,Cj,))}. 

il = li2 = l ir = l 

This expected loss reduction B^ is an expectation taken over the unknown set of labels 
(Yr|xr). B'^ is the statistical target for batch AL, and the direct analog of defined in 
Equation 

Estimating B^ incurs two major computational costs, in comparison to estimation. 
Eirst there is the huge increase in the number of selection candidates. Eor individual AL, 
each selection candidate is a single example, and there are only Up candidates to consider 
(where Up is the pool size). Under batch AL, each selection candidate is a set of examples, 
each set having size r; the number of candidates jumps to {^f)- Thus batch AL generates 
a drastic increase in the number of selection candidates, from Up to (”^), which presents a 
major computational cost. 

The second cost of B^ estimation lies in the number of calculations per selection can¬ 
didate. In individual AL, each candidate requires one classifier retraining and one loss 
evaluation per class, for all k classes. However in batch AL, each candidate requires multi¬ 
ple classifier retraining and loss evaluations, each candidate now requiring D calculations. 
Hence the number of calculations increases greatly, from k to , which is a severe compu¬ 
tational cost. 

These major computational costs make direct estimation of the target extremely 
challenging. Thus for batch AL the more practical option is to recommend algorithms that 
estimate Q‘^, such as the algorithms given in Section]^ 

and B^ together dehne the optimal AL behaviour for individual AL and batch AL. 
These targets provide optimal AL behaviour for both single-step and iterated AL. The rest 
of this work focusses on the target as the foundation of MRI’s estimation framework for 
AL. 

3.2 Abstract Example 

An example using an abstract classification problem is presented, to illustrate MRI in detail. 
The stochastic character of this problem is fully specihed, allowing exact calculations of the 
loss L, and the statistical target as functions of the univariate covariate x. To reason 
about as a function of x, an infinite pool is assumed, allowing any x G M to be selected. 
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These targets are then explored as functions of x, and the optimal AL selection x* is 
examined (see Equation]^. 

The full stochastic description allows examination of the AL method’s selection, denoted 
Xr, and comparison to the optimal selection x*. This comparison is made below for the 
popular AL heuristic Shannon entropy, and for random selection. 

Imagine a binary univariate problem, defined by a balanced mixture of two Gaussians: 
{tt = ( 2 , 5 ), {X\Y = Cl) ~ N(—1,1), {X\Y = C 2 ) ~ N(l, 1)}. The true means are denoted 
/xi = — 1,^2 = 1- The loss function is error rate Lg (defined in Section 2.1), while the true 
decision boundary to minimise error rate is denoted t = + ^ 2 )- 

Every dataset D of size n sampled from this problem is assumed to split equally into 
two class-pure subsets Dj = {yi = Cj,{xi,yi) £ D} each of size rij = this is sampling 
while holding the prior fixed. 

Consider a classifier that estimates only the class-conditional means, given the true 
prior TV and the true common variance of 1. The classifier parameter vector is 9 = (/ii,/i 2 ), 
where fij is the sample mean for class Cj. This implies that the classifier’s estimated decision 
boundary to minimise error rate is denoted i = + y 2 )- 


3.2.1 Calculation and Exploration of 


Here is calculated, then explored as a function of x. The classifier’s decision rule ri(x) 
minimises the loss Le{9), and is given in terms of a threshold on the estimated class prob¬ 
abilities by 


ri(x) 


y = ci :pi(x)> 2 , 
y = C 2 : pi{x) < 2 , 


or equivalently, in terms of a decision boundary on x, by 


J < y 2 : y = Cl if X < t, C 2 otherwise, 
\ yi> fi 2 : y = Cl if X > t, C 2 otherwise. 


The classifier may get the estimated class means the wrong way around, in the unlikely 
case that fti > jl 2 - As a result the classifier’s behaviour is very sensitive to the condition 
{jli > (12), as shown by the second form of the decision rule r2{x), and by the loss function 
in Equation 

It is straightforward to show that the loss Lf,{9) is given by 

Le(0) = ^{1 - Flit) + F2{i) + l(/ii > /i2)[2Ei(t) - 2F2{i)]}, (4) 

where Fj(x) denotes the cdf for class-conditional distribution {X\Y = Cj). 

In individual AL an unlabelled point x is chosen for the oracle to label, before retraining 
the classifier. Retraining the classifier with a single new example (x,Cj) yields a new pa¬ 
rameter estimate denoted 6 ^, where the mean estimate for class Cj has a new value denoted 
fi'p with a new estimated boundary denoted t'y 

Here fi'j = (1 — z)fij + zx where z = z being an updating constant which reflects 
the impact of the new example on the mean estimate fij. 
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To calculate under error loss Lg, observe that the Term Tg from Equation is 
\piLe{ 6 i) + p 2 Le( 62 )]- Term Tc in Equation is directly given by Equation]^ From 
Equations and Q^{x,9, D) = Le{0) — \piLe{6i) + p2Le{02)], hence 

Q^{x,e,D) = ^{1 - Fi(t) + F 2 (t) + l(/ii > fi2)[2Fi{i) - 2F2{i)]} 

- F,{i[) + F2ii[) + l(A'i > fi2)[2F,ii[) - 2F2{i[)]} 

-^{1 - Fi(t') + F2ii'2) + l(Ai > P^2)[‘2Fiii'2) - 2F2{i'2)]}, 


where pj, /i'-, and t'- are functions of x. 

Even for this simple univariate problem, Q‘^{x, 9, D) is a complicated non-linear function 
of x. Given this complication, is explored by examining specific cases of the estimated 
parameter 0, shown in Figure In each specific case of 0, x* yields greatest correction 
to 0 in terms of moving the estimated boundary t closer to the true boundary t. This is 
intuitively reasonable since error rate is a function of t and minimised for t = t. 

In the first two cases (Figures 3a and 3b), the estimated threshold is greater than the 
true threshold, t > t. In these two cases, x* is negative, hence retraining on x* will reduce 
the estimated threshold t, bringing it closer to the true threshold t, thereby improving the 
classifier. In the third case (Figure 3c), t = t and here the classifier’s loss Lg cannot be 
reduced, shown by Q‘^(x) < 0 for all x. The fourth case (Figure [M]) is interesting because the 
signs of the estimated means are reversed compared to the true means, and here the most 
non-central x offer greatest classifier improvement. Together these cases show that even for 
this toy example, the improvement function is complicated and highly dependent on the 
estimated parameters. 


3.2.2 Exploration of Shannon Entropy and Random Selection 


The abstract example is used to compare is two selection methods, SE and RS, against 
optimal AL behaviour. 

SE always selects Xr at the estimated boundary t. RS selects uniformly from the pool, 
assumed to be i.i.d. in AL, hence the RS selection probability is given by the marginal 
density p{x). In contrast to and SE, RS is a stochastic selection method, with expected 
selection x^ = 0 in this problem. Figure illustrates Q'’’, SE and p{x) as contrasting 
functions of x, with very different maxima. 

is asymmetric in the first two cases, and symmetric for the final two. By contrast, 
SE and RS are always symmetric (for all possible values of 6). 

SE selects a central Xr, thereby missing the 


In the first two cases (Figures 4a and 


optimal selection x*. In the second case (Figure [Sb]), SE selects Xr with Q‘^{xr) < 0, failing 
to improve the classifier, whereas the optimal selection x* does improve the classifier since 
Q^{x^) > 0. The third case is unusual, since t = t and this classifier’s loss cannot be 
improved, hence Q'^{x) < 0 for all x. In the fourth case (Figure [dd] ) SE makes the worst 
possible choice of x. In all four cases, SE never chooses the optimal point; SE may improve 
the classifier, but never yields the greatest improvement. These specific cases of 6 show 
that SE often makes a suboptimal choice for x^, for this abstract example. 
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(a) 9= {fii= -0.5, fi 2 = 1-5); (b) 9 = {fii = -0.9,/i 2 = 1-1); 

fli, jj ,2 are right-shifted, fij = fij + 0.5 (li, /i 2 are right-shifted, flj = + 0.1 




(c) 9 = {fii= -1.1, A 2 = 1-1); (d) 9 = {fLi = l,(i 2 = -1); 

/ti, /t 2 are wider, |/rj|-|-0.1 /ti, /t 2 have inverse signs, jxj = 

Figure 3: Illustration of the target as a function of x, for specific cases of the estimated 
classifier parameters 0 = The class mean parameters are shown in solid blue and 

red, with the estimated means shown in dotted blue and red. The green line indicates 
Q^{x) = 0 (zero improvement); in all cases, Ug = 18. In each specific case, the optimal 
selection x* yields greatest correction to 6 in terms of moving the estimated boundary t 
closer to the true boundary t. 
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(a) 9= {fii= -0.5,/t 2 = 1-5); (b) 9 = {fii = -0.9, fl 2 = 1.1); 

fli, jj ,2 are right-shifted, fij = fij + 0.5 /ti, /i 2 are right-shifted, flj = + 0.1 




(c) 9 = {fii= -1.1, A 2 = 1-1); (d) 9 = {fLi = l,(i 2 = -1); 

/ti, /t 2 are wider, |/rj|-|-0.1 /ti, /t 2 have inverse signs, jxj = 


Figure 4: Comparison of Q'^ against SE and RS as functions of x, for specific cases of the 
estimated classifier parameters 6 = is shown in black, SE in purple and RS 

in orange (for RS, the density p{x) is shown). The class mean parameters are shown in 
solid blue and red, with the estimated means shown in dotted blue and red. The green 
line indicates Q'^(x) = 0 (zero improvement); in all cases, Ug = 18. The three functions are 
scaled to permit this comparison. 
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Turning to consider RS, the stochastic nature of RS suggests that the expected RS 
selection is the quantity of interest. For these four cases of 6 , the expected RS selection is 
a suboptimal choice of x* for this problem. It is notable that the expected RS selection is 
usually close to the SE selection. The stochastic nature of RS implies that it often selects 
far more non-central x values than SE. 


3.3 Unbiased Q'^ Estimation Outperforms Random Selection 


We present an argument that an unbiased estimator of will always exceed RS in AL 
performance. This formal approach opens the door to a new guarantee for AL, which moti¬ 
vates the estimation framework that MRI provides. This guarantee is not tautological since 
RS generally improves the classifier, making RS a reasonable benchmark to outperform. By 
contrast, heuristic AL methods such as SE lack any estimation target, making arguments of 
this kind difficult to construct. This argument also motivates the algorithm bootstrapMRI 


(see Section 4.2). 

The context is an AL scenario with a specific classification problem, classiher and loss 
function. We examine the selection of a single example from a pool Xp consisting of just 
two examples Xp = {xi,X 2 }. 

From Equation 2, the target function depends on both the labelled data Ds and the 
population distribution {X, Y). This dependency on both data and population is somewhat 
unusual for an estimation target, but other statistical targets share this property, for exam¬ 
ple classiher loss. Here the labelled dataset Dg is considered a random variable, hence the 
values of over the pool are also random. Consider a hypothetical estimator, unbiased 
in this sense: (Vxj G M.)E[Q‘^{xi,d, Ds)] = [Q'^{xi,9, Ds)]- 

For a single example Xi, the true and estimated values of Q'^ are denoted by Qt = 
Q'^{xi,9, Ds) and Qi = Q‘^{xi,9, Ds) respectively. Since the estimator is unbiased, the 
relationship between these quantities can be conceptualised as Qi = Qi + Mi, where Mi is 
dehned as a noise term with zero mean and variance cr^, with Ep)g[Mi\ = 0. We assume 
that Mi _LL Qi, and make the moderate assumption that Mi ~ N[{),a‘^). 

The difference between the true values is dehned as R = Qi — Q 2 - We begin by 
addressing the case where {R > 0) i.e. {Qi > Q 2 )- The probability that the optimal 
example is chosen, denoted A, will illustrate the estimator’s behaviour under different noise 
variances, ci^. 

We now quantify the selection probability A explicitly in terms of estimator variance. 
This selection probability A is given by 


A = p{Qi > Q 2 ) 

= p{Qi + Ml > Q2 + M2) 
= p{Qi -Q 2 > M2 - Ml) 

= p{M2-Mi<Qi-Q2), 


which can be rewritten as A = p{N < A) where N = M 2 — Mi is dehned as a mean zero 
RV, and A = Qi — Q 2 is strictly positive (since R > 0). A is Gaussian, since Mi and M 2 
are both Gaussian. This variable A provides a ranking signal for example selection: its sign 
shows that xi is a better choice than X 2 , and its magnitude shows how much better. 
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Further defining a = p{N < 0) and /3 = p{0 < N < A) and combining with p(N < 
A) = p{N < 0) + p{N < 0 < A) gives A = a + /?. Here a _LL A whereas /3 4^ A, showing 
that a is a pure noise term devoid of any Q'^ ranking information, while /3 contains ranking 
information by its dependency on A. 

We now establish that a = ^ hy examining the special case of estimator variance 
tending towards infinity. This value of a = ^ proves important in relating the selection 
behaviour of the infinite-variance estimator to random selection. 

As t oo, /3 4 0) this result being shown in Appendix A. Hence as t oo, A 4 a. 
Thus as t oo, A _LL {Qi,Q 2 ) since a _LL {Qi,Q 2 ), hence A becomes independent of true 
Q'^ values, depending only on noise. Hence the limiting case, as the estimator variance 
approaches infinity, corresponds to uniform selection over the pool. 

A closely related argument for a = ^ is the impossibility of selection by signal-free noise 
a outperforming RS. Again considering t oo, if a > ^ then A > ^, which will consistently 
prefer the better example xi, and therefore consistently outperform RS. Whereas a < ^ 
gives A < ^, which will consistently prefer the worse example X 2 , and therefore consistently 
underperform RS. However, outperforming RS when selecting examples by noise alone is 
impossible, which implies a = |. Further, N is Gaussian with mean-zero which directly 
gives a = 2 - 

From a = 2 , A can be expressed purely in terms of (3 as 

A = ^ + /3. (5) 

As t oo, /? 4 0 hence A 4 When a‘^ I 0, (3 ^ as shown in Appendix B. Hence as 
4 0, A t 1- Since N is Gaussian, (3 G ( 0 , ^], hence A G ( 5 , !]• 

Having examined the case where {R > 0), we now consider all of the possibilities for 
R. The zero probability case {R = 0) is discarded, leaving only the second case defined by 
{R < 0). 

In this second case {R < 0) i.e. {Qi < Q 2 ), the optimal selection is X 2 , with 

A = p{Q2 > Qi) 

= p{Q2 + M 2 > Qi + Ml) 

= p{M2 -Mi>Qi-Q2), 

rewritten as A = p{N > — A 2 ) where A 2 = Q 2 — Qi is strictly positive (since R < 0). Hence 

A = p{N > -A 2 ) 

= p{N > 0) -I- p(—A 2 < A < 0) 

= {1- a) + /32 

= l+f^2, 

where (32 = p(—A 2 < A < 0). Since A is Gaussian, it is symmetric, giving (32 = p(—A 2 < 
A < 0) = p(0 < A < A 2 ). 

Here A and A 2 differ only in magnitude, and their magnitudes do not feature in the 
proofs in Appendices F and G. As a result, (32 takes the very same values as (3 when a 4 0 or 
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(T t oo, namely { 5 , 0 } (see the proofs in Appendices F and G). Thus the selection behaviour 
of the unbiased estimator is the same for both cases of {R > 0) and {R < 0), both cases 
being described by Equation 

The RHS of Equation quantifies the combination of signal and noise in AL selection, 
with the estimator variance a'^ determining P and A. Now the AL performance under the 

estimator can be elucidated in terms of the estimator variance. 

The extreme case of infinite variance where A = ^ implies that the selection of examples 
is entirely random, and here the estimator’s behaviour is identical to random selection (RS), 
an established AL benchmark. By contrast, if A exceeds examples with better values 
are more likely to be selected, leading to better expected AL performance than RS. 

This argument applies directly to a pool of two elements. The ranking of a larger pool 
can be decomposed into pairwise comparisons, which may extend this argument to any 
pool. This argument serves to illustrate that an unbiased estimator outperforms RS, 
which is a new guarantee for AL. This argument receives experimental support from the 
results described in Section [531 

We make no attempt to prove the existence of such an unbiased Q'^ estimator. The 
bootstrapMRI algorithm given in Section 4^ is constructed, as far as is practical, to capture 
the key characteristics of an ideal unbiased estimator. 


4. Algorithms to Estimate Model Retraining Improvement 

Eor practical estimation of Term Tc in Equation]^ can be ignored since it is independent 
of X. Thus the central task of practical estimation is the calculation of Term Tg in 
Equation this Term Tg being the expected classifier loss after retraining on the new 
example x with its unknown label y|x. The definition of Term Tg in Equation includes 
two components: p and L'. Consequently, estimation requires estimating these two 
components from one labelled dataset Ds- 

Estimating multiple quantities from a single dataset raises interesting statistical choices. 
One major choice must be made between using the same data to estimate both components 
(termed nawe reuse), or to use bootstrapping to generate independent resampled datasets, 
producing independent component estimates. This choice between naive reuse and boost¬ 
rapping has implications for the bias of estimates, discussed below. 

Here we assume that loss estimation itself requires two datasets, for training and testing, 
denoted Dt and De respectively. Since p estimation requires one dataset, then three 
datasets are needed in total, denoted Dp, Dp and Dp, to estimate the two components p 
and L': 

• The class probability vector, p = p{Y\x), estimated by p using dataset Dp, 

• The future loss vector, L', estimated by L' using datasets Dp and Dp- 

Each of these three datasets {Dp, Dp and Dp) must be derived from Ds- 

In the case of naive reuse, all three datasets equal Ds, yielding the algorithm simpleMRI 
described below. For bootstrapping, the three datasets are all resampled from Ds with 
replacement, giving the algorithm bootstrapMRI described below. These two algorithms 
are extreme cases, chosen for clarity and performance; numerous variations are possible 
here. 
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A statistical estimate is considered precise when it has low estimation error. Literature 
on empirical learning curves suggests that classifier loss L is larger for smaller training data 


samples (Perlich et al. 


2003 

Gu et al. 

2001 

Kadie 

1995 


1995). This implies that p is difficult 


to estimate precisely, since precise estimates of p would directly produce a near-optimal 
classifier (one close to the optimum Bayes classifier, in terms of loss). The increased loss for 
smaller samples further implies that loss L itself is hard to estimate precisely for a small 
training dataset; for if loss could be precisely estimated, a near-optimal classifier could be 
found by direct optimisation. 

This line of reasoning suggests that the two main components of Q'^, p and L', are 
both very difficult to estimate precisely from small data samples. In practical applications 
where all quantities must be estimated from data, the estimates will inevitably suffer from 
imprecision. 


4.1 Algorithm SimpleMRI 


We present the simpleMRI algorithm to estimate Q^, to illustrate the statistical framework. 
The pseudocode for simpleMRI is provided in Algorithm This first algorithm takes a 
simple approach where all of Ds is used to estimate all three components. The algorithm 
uses the maximum amount of data for each component estimate, broadly intending to reduce 
the variance of these component estimates. 

The class probability vector p is estimated by training a second classifier 62 on L>p, then 
using its predicted probability vector p for the example x. This second classifier is 5-nn, or 
random forest when the base classifier is /c-nn (Breiman 2001). For the future loss vector 


L', each element L'j is estimated by training the base classifier 0() on Dt U (x, Cj), then 
computing a loss estimate using De- 

The simpleMRI algorithm immediately encounters a problem in estimating Term Tg: 
the same data Ds is used both to train the classifier and also to estimate the loss. This in¬ 


sample loss estimation is known to produce optimistic, biased estimates of the loss (Hastie 


et al. 2009, Chapter 7). The simpleMRI algorithm suffers another potential problem with 


bias: the same data Ds is used to estimate the class probability and estimate the loss, 
leading to dependence between the estimates of p and L'. This dependence of component 
estimates may produce bias in the estimate from simpleMRI, since the argument of 
Equation for unbiased estimation requires independent component estimates. 

These two problems of biased and dependent component estimates under naive reuse 
motivates the development of a second algorithm, termed bootstrapMRI, described below. 

For computational efficiency, Q'^ values are only evaluated on a randomly (uniformly) 
chosen subset of the pool. This popular AL optimisation is commonly termed random 
sub-sampling. 


4.2 Algorithm BootstrapMRI 

BootstrapMRI seeks to minimise estimator bias in two ways: by generating independent 
component estimates, and by providing component estimators of reasonably low bias. If the 
two component estimators p and L' are independent, and both unbiased, then the esti¬ 
mator will be unbiased, as shown below in Section [4.3[ The pseudocode for bootstrapMRI 
is provided in Algorithm 
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Algorithm 1 SimpleMRI 
1 : procedure SimpleMRI(x, 0, 1 ) 5 , @ 2 ) 

2: Dp i — Dg 

3: Dp i — Dg 

4: Dp ^ Dg 

5: estimate class probability vector p ai x 
6 : O 2 •<— 02{Dp) 

7 : p^(/>2(02,x) 

8: estimate future loss vector L' 

9: for j G [1 : k\ do 

10: 6j 6 [Dt U (x, Cj)) 

^ 1 ^ H{yie,ye)£DE Me{0j,:>Ce,ye) 

12 : Te •(— p • L' 

13: Q‘^ Te 


The labelled dataset Dg is resampled by bootstrapping to form three datasets Dp, Dp 
and Dp- These three datasets are independent draws from the ecdf of Dg, yielding inde¬ 
pendent estimates (Efron, 1983, Chapter 6 ). 


The first dataset Dp provides an estimate for the class probability p, by classifier train¬ 
ing on Up and class probability prediction on x. As before, for p estimation, a second 
classifier 62 is used, chosen in the very same way as simpleMRI above. The second and 
third datasets Dp and Dp together provide an estimate of the future losses vector L'. Each 
element L'j is estimated by training the base classifier 0{) on Dp U (x, Cj), then computing 
a loss estimate using Dp. 

In the experimental study of Section the stochastic resampling is repeated, rib = 25 
times, and the resulting estimates are averaged. Random sub-sampling of the pool is used 
for efficiency. 


4.3 BootstrapMRI Algorithm Properties 

BootstrapMRI seeks to minimise estimation bias by generating independent component 
estimates, as shown below in Equation Practical estimation requires calculating only 
Term Tg in Equation 2 (Term Tg can be ignored for practical estimation, since Term Tg _LL x). 
Term Tg is a product of p and L', the two components of Q'^ to be estimated. 

The dehnitions of unbiased estimation are given below: 

• Unbiasedness for p is defined as (Vxj G M'^)T[p(xj)] = p(xj). 

• Unbiasedness for L' is defined as (Vxj G M'^)T[L'(xj)] = L'(xj). 

• Unbiasedness for Q'^ is dehned as (Vxj G 0, 115 )] = ,Dg), 

where the expectations are taken over the variability of the estimators. 

The independence of p and L' is classical statistical independence: (p _LL L') 44 [p(p = 
a, L' = b) = p{p = a)p(L' = b)], for constant vectors a and b. 
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Algorithm 2 BootstrapMRI 
1: procedure BootstrapMRI(x, 0, n;,, ^ 2 ) 

2 : q zero vector of length Ub 

3: for 6 G [1 : Ub] do 

4: Ip Sample With Replacement{l : |-Ds|) 

5: Ip ^ Sample With Replacement{l : I-D 5 I) 

6 : Ip Sample With Replacement{l : |-Ds|) 

7: Dp ^ Ds[Ip] 

8 : Dp ■<— Ds\Ip'\ 

9 : Dp ^ Ds[Ie] 

10: estimate class probability vector p ai x 
11 : §2 ■<— d2{Dp) 

12: p^(/>2(02,x) 

13: estimate future loss vector L' 

14: for j G [1 : k] do 

15: Oj •<— 6 {Dp U (x, Cj)) 

16: Dj ^ J2{x^,y^)eDE ^e{0j,^e,ye) 

17: Te ^ P • L' 

18: q[b] ■<— Tf, 

19: the final estimate is the average of the estimate vector q 
20 : •(— median{q) 


By generating independent component estimates, bootstrapMRI provides a guarantee: 
that if the two component estimates p and L' are both unbiased, then the estimate will 
be unbiased. This is shown by E[Q^{x)] = Q^{x), since 

E[t] = E[p • L'] (6) 

= E[P] ■ E[L'] 

= pL' 

= T,. 


An ideal scenario would include the Bayes classifier and a large test dataset, providing 
the exact probabilities p and precise, unbiased estimates of Lh In that scenario, the 
estimate will be completely unbiased. In the real application context, neither the Bayes 
classifier nor a large test dataset are available, and it is hard to estimate either component 
p or L' precisely or unbiasedly from a small data sample, these being open research problems 
(Acharya et ah, 2013; [^driguez et al. 2013). 

Small finite samples do not permit guarantees of unbiased estimation. In practice, the 
estimates of p and L' will suffer from both imprecision and bias. The development of 
bootstrapMRI algorithm intends to approach the ideal of unbiased estimation, given the 
component estimators available. 

For practical approximations to unbiased component estimators, we estimate p and L' 
by the 5-nn classifier and by cross-validation respectively. The classifier k-xm has well-known 
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low asymptotic bounds on its error rate, for continuous covariates and a reasonable distance 


metric (Ripley, 1996, Chapter 6). These results suggest that this classifier’s probability 
estimates should have good statistical properties, such as reasonably low bias in the f inite 
sample case. The estimation of L' is nearly unbiased for cross-validation (Efron, 1983). 

The class probability vector p is a component of Q^, which raises a question for 
estimation, of whether p estimates need to be precise for reasonable Q”’ estimation. The 


argument of Section ^, and the experimental results of bootstrapMRI in Section both 
suggest that the p estimates do not need to be very precise, but should merely have rea¬ 
sonably low bias. 

The computational cost of EfeLc at each selection step is given by ta = {tr + tp) + 
{npk{tr + ti)), where = \Xp\ is the size of the pool, k is the number of classes, is 
the cost of classifier retraining, tp is the cost of classifier prediction and ti is the cost of 
classifier loss estimation. The cost for simpleMRI is the same cost as EfeLc, except that 
the L-estimation method differs and hence ti is different. The cost for bootstrapMRI is Ub 
times that of simpleMRI, where Ub is the number of bootstrap resamples. 


5. Experiments and Results 

A large-scale experimental study explores the performance of the new Q'^-estimation AL 
methods. The intention is to compare those methods with each other, and to standard AL 


methods from the literature (described in Section 2.3). The focus is on the relative classifier 
improvements of each AL method, rather than absolute classifier performance. 

The base classifier is varied, since AL performance is known to depend substantially on 


the classifier (Guyon et ah, 2011; Evans et ah, 2013). To provide model diversity, the study 


uses several classifiers with different capabilities: LDA, 5-nn, naive Bayes, SVM, QDA and 
Logistic Regression. The classifiers and their implementations are described in Appendix 
C. 

Many different classification problems are explored, including real and simulated data, 
described in Appendix D. These problems are divided into three problem groups to clarify 
the results, see Section 5.4 The experimental study uses error rate for the loss function L 


(see Section 2.1). Eurther results are available for another loss function, the H measure, but 
are omitted for spac^ the choice of loss function does not affect the primary conclusion of 
Section 15.51 


The experimental study explores several sources of variation: the AL algorithms, the 
classifier 9, and the classification problem (X, E). 


5.1 Active Learning Methods in the Experiment 

The experimental study evaluates many AL methods, to compare their performance across 
a range of classification problems. These methods fall into three groups: RS as the natural 
benchmark of AL, standard AL methods from the literature, and algorithms estimating 
Q'’. The second group consists of four standard AL methods: SE, QbcV, QbcA, and EfeLc 


(all described in Section 2.3). The third group contains the two Q^-estimation algorithms. 


simpleMRI and bootstrapMRI, defined in Section and abbreviated as SMRI and BMRI. 


1. For these results see http://www.lewisevans.com/JMLR-Extra-Experimental-Results-Feb-2015.pdf 
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For the two Qbc methods, a committee of four classifiers is chosen for model diversity: 
logistic regression, 5-nn, 21-nn, and random forest. Random forest is a non-parametric 


classifier described in Breiman (2001); the other classifiers are described in Appendix C. 


This committee is arbitrary, but diverse; the choices of committee size and constitution are 
open research problems. 

Density weighting is sometimes recommended in the AL literature, see 


Olsson (2009). 


However, the effects of density weighting are not theoretically understood. The experimental 
study also generated results from density weighting, omitted due to spac^ which left 
unaltered the primary conclusion that the Q'^-estimation algorithm bootstrapMRI is very 
competitive with standard methods from the literature. The issue of density weighting is 
deferred to future work. 

5.2 Experimental AL Sandbox 

Rerated AL provides for the exploration of AL performance across the whole learning curve. 


see Section 2.2 and Guyon et al. (2011); Evans et al. (2013). In this experimental study, the 


AL iteration continues until the entire pool has been labelled. The pool size is chosen such 
that when all of the pool has been labelled, the final classiher loss is close to its asymptotic 
loss (that asymptotic loss being the loss from training on a much larger dataset). The AL 
performance metrics described below examine the entire learning curve. 

Each single realisation of the experiment has a specific context: a classification problem, 
and a base classifier. The classification data is randomly reshuffled. To examine variation, 
multiple Monte Carlo replicates are realised; ten replicates are used for each specific context. 

Given this experimental context, the experimental AL sandbox then evaluates the per¬ 
formance of all AL methods over a single dataset, using iterated AL. Each AL method 
produces a learning curve that shows the overall profile of loss as the number of labelled 
examples increases. The amount of initial labelled data is chosen to be close to the number 
of classes k. To illustrate. Figure shows the learning curve for several AL methods, for a 
single realisation of the experiment. 


5.3 Assessing Performance 


As discussed in Section 2.2, AL performance metrics assess the relative improvements in 


classifier performance, when comparing one AL method against another (or when comparing 
AL against RS). Thus the real quantity of interest is the ranking of the AL methods. 

The AL literature provides a selection of metrics to assess AL performance, such as 


AUA (Guyon et al., 2011), WI (Evans et ah, 2013) and label complexity (Dasgupta, 2011). 
The experimental study evaluates four metrics: AUA, WI with two weighting functions 
(exponential with a = 0.02, and linear), and label complexity (with e = 5). Each of these 
four metrics is a function of the learning curve, creating a single numeric summary from 
the learning curve, this curve being generated by iterated AL. 

The overall rank is also calculated as the ranking of the mean ranks, as employed, 
for example, by Brazdil and Soares (2000). This yields five AL metrics in total: four 
primary metrics (label complexity, AUA, Wl-linear, Wl-exponential) and one aggregate 


2. For these results see http://www.lewisevans.com/JMLR-Extra-Experimental-Results-Feb-2015.pdf 
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Figure 5: Result for a single experiment of iterated AL. Each AL method performs multiple 
selection steps, generating a set of losses that dehne the learning curve. For clarity, a 
smoothed representation of the data is presented. The early part of the learning curve 
is shown. The classihcation problem is the Four-Gaussian problem (see Figure 6a and 
Appendix D), with the base classifier being 5-nn. 
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metric (overall rank). The overall rank avoids any arbitrary choice of one single metric. In 


this experimental study, AL performance is assessed by overall rank, as used in Brazdil and 
Soares (2000 | . 


For a single experiment, there is a single classihcation problem and base classifier. In 
such an experiment, all five metrics are evaluated for every AL method, so that each metric 
produces its own ranking of the AL methods. Since there are seven AL methods (see Section 
5.1), the ranks fall between one and seven, with some ties. For brevity, the tables show the 


best six methods, chosen by overall rank. 

The experimental results show that the AL metrics substantially agree on AL method 
ranking (see Tables and [^. This agreement suggests that the results are reasonably 
insensitive to the choice of AL metric. 


5.4 Aggregate Results 

To address the variability of AL, multiple Monte Carlo draws are conducted for each clas¬ 
sification problem. Thus for each experiment, the labelled, pool and test data are drawn 
from the population, as different independent subsamples. This random sampling addresses 
two primary sources of variation, namely the initially labelled data and the unlabelled pool. 

Table 2: Results for a single pair of classifier and problem, averaged over ten Monte Carlo 
replicates. The base classifier is LDA. The classification problem is Australian Credit (see 
Appendix D). These six AL methods are the best six, ordered by overall rank (calculated 
by numerical averages of ranks). The Q'^ algorithms are shown in bold. 


Classiher LDA and Single Problem (Australian Credit) 



BMRI 

QbcV 

QbcA 

EfeLc 

SMRI 

RS 

Overall Rank 

1 

2 

3 

4 

5 

6 

Label Complexity 

1 

2 

4 

5 

6 

3 

AUA 

1 

2 

4 

3 

5 

6 

WI-Linear 

2 

1 

3 

5 

4 

6 

WI-Exponential 

1 

2 

3 

5 

4 

5 


The experimental study examines many Monte Carlo draws, classification problems 
in groups, and classifiers. The goal here is to determine the relative performance of the 
AL methods, namely to discover which methods perform better than others, on average 
across the whole experimental study. To that end, the aggregate results are calculated by 
averaging. First the losses are averaged, over Monte Carlo replicates. From those losses, 
AL metrics are calculated, which imply overall rankings. Finally those overall rankings 
are averaged, over classification problems, and then over problem groups, and finally over 
classifiers. 

For a single pairing of classifier and problem, there are ten Monte Carlo replicates. 
Consider the true distribution of AL metric scores for each AL method, where the source of 
the variation is the random sampling. The performance of each AL method is encapsulated 
in the score distribution, which is summarised here by the mean. The set of mean scores 
implies a performance ranking of the AL methods. These rankings are then averaged to 
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produce a final overall ranking. Integer rankings of the AL methods are shown for clarity. 
The frequency with which each AL method outperforms random selection is also of great 
interest, and calculated from the group-classifier rankings. 

To summarise the aggregate result calculations: 


R1 For a single problem-classifier pairing, the average losses are calculated, over the ten 
Monte Carlo replicates. This averaging of the losses reduces the variability in the 
learning curve. From these average losses, four AL metric numbers are calculated, 
leading to five rankings of the AL methods, see Table 

R2 For a single group-classifier pairing, the overall rankings of all problem-classifier pair¬ 
ings in the group are averaged, see Table 


R3 For a single classifier, the overall rankings for all three group-classifier pairings are 
averaged, see Table (and Tables [9 13 in Appendix E). 

R4 Finally, the overall rankings for all six classifiers are averaged, see Table 


R5 The frequency counts show how often each AL method outperforms RS. These are 
calculated from the group-classifier rankings (18 in total), see TableFor example. 
Table shows BMRI and SE outperforming RS three times out of three, whereas 
QbcA only outperforms RS twice. 


Table 3: Results for a single classifier and a group of problems. The base classifier is QDA. 
The classification problem group is the large problem group (see Appendix D). These six 
AL methods are the best six, ordered by overall rank (calculated by numerical averages of 
ranks). The algorithms are shown in bold. 


Classifier QDA and Single Problem Group (Large Data) 



BMRI 

SMRI 

EfeLc 

SE 

RS 

QbcV 

Overall Rank 

1 

2 

3 

4 

5 

6 

Label Complexity 

5 

7 

4 

6 

3 

1 

AUA 

1 

2 

3 

4 

5 

6 

WI-Linear 

2 

1 

3 

4 

5 

6 

WI-Exponential 

2 

1 

3 

4 

5 

6 


Table 4: Results for base classifier LDA over three groups of problems. These six AL 
methods are the best six, ordered by overall rank (calculated by numerical averages of 
ranks). The Q'^ algorithms are shown in bold. 


Classifier LDA 


Small Problems 

QbcV 

QbcA 

BMRI 

SE 

SMRI 

RS 

Large Problems 

SE 

BMRI 

SMRI 

QbcA 

QbcV 

RS 

Abstract Problems 

BMRI 

QbcV 

SMRI 

SE 

RS 

QbcA 

Average 

BMRI 

QbcV 

SE 

SMRI 

QbcA 

RS 
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Thus the aggregate results are calculated by averaging over successive levels, one level 
at a time: over Monte Carlo replicates, over problems within a group, over groups, and 
finally over classihers. This successive averaging is shown by the progression from specihc 
realisations to the whole experiment, which starts at Figure then moves through Tables 
[2] to m inclusiv^ 

5.5 Results 

The overall performance of the AL methods is summarised by the final ranking, shown in 
Table and the frequency of outperforming RS, given in Table These two tables provide 
the central results for the experimental study. 

Table 5: Final ranking of AL methods, over six classifiers and three groups of problems. 
The algorithms are shown in bold. 


Final Ranking of AL Methods 



Rank 1 

Rank 2 

Rank 3 

Rank 4 

Rank 5 

Rank 6 

Rank 7 

Average Rank 

BMRI 

SE 

QbcV 

QbcA 

RS 

SMRI 

EfeLc 


Table 6: Frequency of outperforming RS, for six classihers over three groups of problems. 
The count shows the number of times that each AL method outperforms RS, for each group- 
classiher pairing (18 in total). The count falls in the range [0,18]. The Q'^ algorithms are 
shown in bold. 


Frequency of Outperforming Random Selection 



Rank 1 

Rank 2 

Rank 3 

Rank 4 

Rank 5 

Rank 6 

Method 

BMRI 

SE 

QbcV 

SMRI 

QbcA 

EfeLc 

Count better than RS 

15 

14 

13 

9 

8 

2 


The primary conclusion is that the Q'^-motivated bootstrapMRI algorithm performs well 
in comparison to the standard AL methods from the literature. This conclusion holds true 
over different classihers and different classihcation problems. 

Table suggests that just three methods consistently outperform RS: bootstrapMRI, 
SE and QbcV. BootstrapMRI outperforms RS hfteen times out of eighteen. This pro¬ 
vides experimental conhrmation for the argument that unbiased estimation algorithms 


consistently outperform RS, given in Section 3.3 


Comparing the Q'^-estimation algorithms against each other, the algorithm bootstrapMRI 
outperforms the algorithm simpleMRI, in all cases except two. This suggests that minimis¬ 
ing bias in estimation may be important for AL performance. 

Examining the AL methods from the literature, QBC and SE consistently perform well. 
Eor QBC, vote entropy (QbcV) mostly outperforms average Kullback-Leibler divergence 
(QbcA). EfeLc performs somewhat less well, perhaps because of the way it approximates 


loss using the unlabelled pool (see Section 2.3). Eor most classihers, RS performs poorly. 


3. For further details see http://www.lewisevans.com/JMLR-Extra-Experimental-Results-Feb-2015.pdf, 
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with many AL methods providing a clear benefit; SVM is the exception here, where RS 
performs best overall. 

The detailed results for each individual classifier are given in Appendix E. 

Section describes the difficulty of estimating the components, p and L', from 
small data samples. For the practical algorithms, the estimates of p and L' will suffer 
from imprecision and bias. The experimental results show that despite these estimation 
difficulties, strong AL performance can still be achieved. 

6. Conclusion 

Model retraining improvement is a novel statistical framework for AL, which characterises 
optimal behaviour via classifier loss. This approach is both theoretical and practical, giving 
new insights into AL, and competitive AL algorithms for applications. 

The MRI statistical estimation framework begins with the targets and B^. These 
quantities define optimal AL behaviour for the contexts of pool-based AL: individual and 
batch, single-step and iterated. 

Exploring the abstract definition of optimal AL behaviour generates new insights into 
AL. For a particular abstract problem, the optimal selection is examined and compared to 
known AL methods, revealing exactly how heuristics can make suboptimal choices. The 
framework is used to show that an unbiased estimator will outperform random selection, 
bringing a new guarantee for AL. 

The MRI framework motivates the construction of new algorithms to estimate Q^. A 
comprehensive experimental study compares the performance of Q'^-estimation algorithms 
to several standard AL methods. The results demonstrate that bootstrapMRI is strongly 
competitive across a range of classifiers and problems, and is recommended for practical 
use. 

There are many more statistical choices for Q'^-estimation algorithms. These choices 
include various methods to estimate the class probability p (e.g. via the base classifier, 
a different classifier, or a classifier committee); different methods to estimate the loss 
(e.g. in-sample, cross-validation, bootstrap, or via the unlabelled pool); and many further 
ways to use the data (e.g. full reuse, resampling, or partitioning). More sophisticated 
estimators are the subject of future research, and hopefully MRI will motivate others to 
develop superior estimators. 

The estimation framework enables reasoning about AL consistency behaviour and stop¬ 
ping rules, which are the subject of future work. The MRI framework opens the door to 
potential statistical explanations of AL heuristics such as SE and Qbc, whose experimental 
effectiveness may otherwise remain mysterious. 
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Appendix A. 

This Appendix shows that given a zero-mean univariate Gaussian RV denoted N with 
variance a, a positive constant S, a fixed-sized interval [0, 5), and the probability j3 = p(0 < 
N < S), then as t oo, /3 | 0. 

N is Gaussian with mean zero, hence it has cdf Fn{x) = 5(1 F where erf(i/) = 

1 ry By definition 

J-y 


As t 00 , 


/3 = p(0 < A < 5) 

= F^{S) - Fn{0) 

= Fn(S) - I 

= xerf(- 


a 


V2' 


1 

2 




fj 


^/2 


1 

f 0^/2 

p-t- 



6 

1 

/%- 



/o 


dt 


= 0 . 


Hence as a'^ t 00 , /3 | 0. 

The above argument applies to a RV N and a fixed interval [0,(5), but also applies to 
a random interval [0, A) with A being a strictly positive RV, since the argument relies to 
every realisation of A. 


Appendix B. 

This Appendix shows that given a zero-mean univariate Gaussian RV denoted N with 
variance a, a positive constant 6, a fixed-sized interval [0, 6), and the probability (3 = p{0 < 
N < 5), then as (T^ 0, /? t 5 - 
By definition, 

p{N >0)=p{0<N<5)+ p{N > (5), 

i.e. 

i = (3 + p{N > (5), 

giving 

P = \-PiN > ^)- 
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By definition 


p{N > 5) <p(|iV|> 6), 

and Chebyshev’s Inequality gives 

hence 

2 

p{N > J) < ^. 

As I 0, I 0 hence p{N > (5) I 0. Thus as 0, p{N > <5) 0, combining with 

/3 = I — p{N > 6) yields /? f 5 as < 7 ^ 0. 

The above argument applies to a RV N and a fixed interval [0,(5), but also applies to 
a random interval [0, A) with A being a strictly positive RV, since the argument relies to 
every realisation of A. 


Appendix C. 


This Appendix describes the six classifiers used in the experimental study of Section and 
their implementation details. 

Sectionj^describes experiments with six classifiers: linear discriminant analysis, quadratic 
discriminant analysis, R-nearest-neighbours, naive Bayes, logistic regression and support 
vector machine. Linear discriminant analysis (LDA) is a linear generative classifier described 


m 


Hastie et al. (2009, Chapter 4). Quadratic discriminant analysis (QDA) is a non-linear 


generative classifier described in ||Hastie et al. (2009, Chapter 4). iL-Nearest-Neighbours (K- 
nn) is a well-known non-parametric classifier discussed in [Duda et al. (2001, Chapter 4). 
Naive Bayes is a probabilistic classifier which assumes independence of the covariates, given 
the class; see Hand and Yu (2001). Logistic Regression is a linear parametric discriminative 
classifier described in Hastie et al. (2009, Chapter 4). The support vector machine (SVM) 
is a popular non-parametric classifier described in Cortes and Vapnik (1995). Standard R 
implementations are used for these classifiers, see below. 

The classifier implementation details are now described. For LDA, the standard R 
implementation is used. For QDA, the standard R implementation is used. For 5-nn, 
the R implementation from package kknn is used|^ This implementation applies covariate 
scaling: each covariate is scaled to have equal standard deviation (using the same scaling 
for both training and testing data). For naive Bayes, the R implementation from package 
el071 is usedj^ For continuous predictors, a Gaussian distribution (given the target class) 
is assumed. This approach is less than ideal, but tangential to the statistical estimation 
framework and experimental study. For Logistic Regression, the Weka implementation 
from package RWeka is usedj^ For SVM, the R implementation from package el071 is 


4. For details see http://cran.r-project.org/web/packages/kknn/kknn.pdf 

5. For details see http://cran.r-project.org/web/packages/el071/el071.pdf 

6. For details see http://cran.r-project.org/web/packages/RWeka/RWeka.pdf, 
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used.The SVM kernel used is radial basis kernel. The probability calibration of the scores 
is performed for binary problems by MLE htting of a logistic distribution to the decision 
values, or for multi-class problems, by computing the a-posteriori class probabilities using 
quadratic optimisation. 


Appendix D. 

A diverse set of classification problems is chosen to explore AL performance. The classifi¬ 
cation problems fall into two sets: real problems and abstract problems. 

First the real data classification problems are shown in Tables and The real data 
problems are split into two groups, one for smaller problems of fewer examples, and another 
of larger problems. The class prior is shown, since the experimental study uses error rate 
as loss. The sources for this data include UCI (Bache and Lichman, 2013), Guyon et al. 


(2011), Anagnostopoulos et al. (2012) and Adams et al. (2010). 


The intention here is to provide a wide variety in terms of problem properties: covariate 
dimension d, number of classes k, the class prior tt, and the underlying distribution. The 
number and variety of problems suggests that the results in Section have low sensitivity 
to the presence or absence of one or two specific problems. 


Table 7: Real Data Classification Problems, Smaller 


Name 

Dim. d 

Classes k 

Cases n 

Class Prior tt 

Australian 

14 

2 

690 

(0.44, 0.56) 

Balance 

4 

3 

625 

(0.08, 0.46, 0.46) 

Glass 

10 

6 

214 

(0.33,0.36,0.08,0.06,0.04,0.14) 

Heart-Statlog 

13 

2 

270 

(0.65, 0.44) 

Monks-1 

6 

2 

432 

(0.5, 0.5) 

Monks-2 

6 

2 

432 

(0.5, 0.5) 

Monks-3 

6 

2 

432 

(0.5, 0.5) 

Pima Diabetes 

8 

2 

768 

(0.35, 0.65) 

Sonar 

60 

2 

208 

(0.47, 0.53) 

Wine 

13 

3 

178 

(0.33, 0.4, 0.27) 


Table 8 : Real Data Classification Problems, Larger 


Name 

Dim. d 

Classes k 

Cases n 

Class Prior tt 

Fraud 

20 

2 

5999 

(0.167, 0.833) 

Electricity Prices 

6 

2 

27552 

(0.585, 0.415) 

Colon 

16 

2 

17076 

(0.406, 0.594) 

Credit 93 

29 

2 

4406 

(0.007, 0.993) 

Credit 94 

29 

2 

8493 

(0.091, 0.909) 


Second the abstract classification problems are illustrated in Figure These abstract 
problems are generated by sampling from known probability distributions. The class- 
conditional distributions (X|y = Cj)i are either Gaussians or mixtures of Gaussians. This 
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(a) Four- 
Gaussian, 


see (Ripley 1996) 


(b) Gaussian 

Quadratic 

boundary 


(c) Triangles of 
three Gaussians 


(d) Gaussian sets 

oscillating- 

boundary 


(e) Gaussian 
sharply non-linear 
boundary 


Figure 6: Density contour plots showing the abstract classification problems. The class- 
conditional distributions (X|y = Cj)i are shown in red for class 1 and blue for class 2. These 
class-conditional distributions (X|y = Cj)i are either Gaussians or mixtures of Gaussians. 
The decision boundary is shown in green. 


set of problems presents a variety of decision boundaries to the classifier. All have balanced 
uniform priors, and the Bayes Error Rates are approximately 0.1. 

Appendix E. 

This Appendix shows the results for each individual classifier in the experimental study 
described in Section The results for LDA, AT-nn, naive Bayes, SVM, QDA and Logistic 
Regression are shown in Tables UKoUTg [TTi [12] and respectively. These results are 
the detailed results of the experimental study, covering the six classifiers, all the problems 
in three groups, and multiple Monte Carlo replicates. In each table, the average rank is 
calculated as the numerical mean, with ties resolved by preferring lower variance of the rank 
vector. 

Table 9: Results for base classifier 5-nn over three groups of problems. These six AL 
methods are the best six, ordered by overall rank (calculated by numerical averages of 
ranks). The algorithms are shown in bold. 


Classifier 5-nn 


Small Problems 

SE 

BMRI 

QbcV 

SMRI 

RS 

QbcA 

Large Problems 

QbcA 

BMRI 

SE 

QbcV 

RS 

SMRI 

Abstract Problems 

BMRI 

SE 

RS 

QbcV 

SMRI 

QbcA 

Average 

BMRI 

SE 

QbcV 

RS 

QbcA 

SMRI 


The results of Section [5 . 5| quantify the benefit of AL over RS: the rankings of Tables]^ 
andshow how much AL methods outperform RS. Another way to quantify AL benefit is 
provided by the regret difference between an AL method and RS. Here AL regret is naturally 
defined as the loss difference, between the optimal performance given by maximising Q^, 
and the actual performance of any given AL method. Another aspect of AL benefit is the 
question of where AL outperforms RS, and this aspect is quantified by the frequency results 
in Table [H 
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Table 10: Results for base classifier naive Bayes over three groups of problems. These six 
AL methods are the best six, ordered by overall rank (calculated by numerical averages of 
ranks). The algorithms are shown in bold. 


Classifier naive Bayes 


Small Problems 

SE 

BMRI 

QbcV 

QbcA 

RS 

SMRI 

Large Problems 

QbcV 

SE 

EfeLc 

BMRI 

SMRI 

QbcA 

Abstract Problems 

SE 

QbcV 

BMRI 

SMRI 

RS 

QbcA 

Average 

SE 

QbcV 

BMRI 

SMRI 

QbcA 

RS 


Table 11: Results for base classifier SVM over three groups of problems. These six AL 
methods are the best six, ordered by overall rank (calculated by numerical averages of 
ranks). The Q‘^ algorithms are shown in bold. 


Classifier SVM 


Small Problems 

QbcV 

RS 

QbcA 

SE 

BMRI 

SMRI 

Large Problems 

RS 

QbcA 

QbcV 

EfeLc 

SMRI 

BMRI 

Abstract Problems 

QbcV 

RS 

QbcA 

BMRI 

SMRI 

SE 

Average 

RS 

QbcV 

QbcA 

BMRI 

SMRI 

SE 


Table 12: Results for base classifier QDA over three groups of problems. These six AL 
methods are the best six, ordered by overall rank (calculated by numerical averages of 
ranks). The Q'^ algorithms are shown in bold. 


Classifier QDA 


Small Problems 

SE 

BMRI 

QbcV 

QbcA 

SMRI 

RS 

Large Problems 

BMRI 

SMRI 

EfeLc 

SE 

RS 

QbcV 

Abstract Problems 

SE 

BMRI 

RS 

QbcV 

QbcA 

SMRI 

Average 

BMRI 

SE 

QbcV 

SMRI 

RS 

QbcA 


Table 13: Results for base classifier Logistic Regression over three groups of problems. 
These six AL methods are the best six, ordered by overall rank (calculated by numerical 
averages of ranks). The algorithms are shown in bold. 


Classifier Logistic Regression 


Small Problems 

QbcV 

QbcA 

BMRI 

SE 

RS 

SMRI 

Large Problems 

SE 

QbcV 

QbcA 

SMRI 

BMRI 

RS 

Abstract Problems 

BMRI 

RS 

SE 

SMRI 

QbcV 

QbcA 

Average 

QbcV 

SE 

BMRI 

QbcA 

RS 

SMRI 
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